Skip to content

feat: add @agentic-db/documents-loader package and CLI docs command#37

Merged
pyramation merged 3 commits intomainfrom
feat/documents-loader
Apr 30, 2026
Merged

feat: add @agentic-db/documents-loader package and CLI docs command#37
pyramation merged 3 commits intomainfrom
feat/documents-loader

Conversation

@pyramation
Copy link
Copy Markdown
Contributor

@pyramation pyramation commented Apr 30, 2026

Summary

Adds a new @agentic-db/documents-loader package for importing/exporting text-based files into the documents table, plus CLI commands (agentic-db docs import/export/list).

New package: packages/documents-loader/

  • Parser — reads files, extracts YAML frontmatter (title, tags, metadata) from .md/.mdx, plain text for other formats
  • Scanner — walks directories, filters by supported extensions (.md, .mdx, .txt, .rst, .html, .xml, .json, .yaml, .yml, .csv, .tsv), ignores node_modules/.git/etc. Automatically parses .gitignore files (root + nested) to skip ignored paths
  • Gitignore parser — self-contained, zero-dependency implementation of the gitignore spec (globs, **, negation, directory-only, character classes, anchored patterns). Suitable for upstreaming to dev-utils
  • Importer — upserts documents via SDK, matched by repo_name + file_path. Last-write-wins conflict resolution. Supports dry-run, progress callbacks, tag merging, commit hash tracking
  • Exporter — fetches documents by repo_name, writes to disk preserving directory structure, optional frontmatter generation
  • SDK client adapter — duck-typed interface so it works with both the SDK's and CLI's generated ORM clients
  • 74 Jest tests covering parser, scanner, importer, exporter, gitignore parser, and full roundtrip scenarios using temp directories with generated files

CLI expansion (sdk/cli/)

  • New docs command with import, export, and list subcommands
  • Wired into the existing custom commands pattern alongside search, ask, embed, config
  • Interactive prompts via inquirerer when args are omitted

CI

  • Added documents-loader-tests job (no database required)
  • Added Build documents-loader step before cli-e2e-tests

Embeddings: handled automatically — the DB's existing triggers set embedding_stale = true on create/update, and the worker picks it up.

Review & Testing Checklist for Human

  • Run cd packages/documents-loader && pnpm test — all 74 tests should pass
  • Review the gitignore parser (src/gitignore.ts) for potential upstream to dev-utils
  • Review the sdk-client.ts duck-typed interface — verify it matches the actual SDK API shape for document.findFirst/findMany/create/update/delete
  • Test agentic-db docs import ./some-dir --repo test-repo against a real database to confirm end-to-end flow
  • Verify frontmatter roundtrip: import a .md with frontmatter, export it, check content is preserved

Notes

  • The package follows constructive-pnpm conventions: makage build, publishConfig.directory: "dist", workspace:* deps
  • ESLint in this repo uses .eslintrc.json (legacy) — individual package lint scripts need ESLINT_USE_FLAT_CONFIG=false due to ESLint v9
  • The docs CLI command is added to the custom commands map (renamed from ragCommands to customCommands to reflect the broader scope)
  • The gitignore parser is intentionally self-contained (no ignore npm dep) so it can be upstreamed to dev-utils later

Link to Devin session: https://app.devin.ai/sessions/e249b6a02652412c8484e5b00fc955dd
Requested by: @pyramation

- New package: @agentic-db/documents-loader for importing/exporting text-based
  files (md, mdx, txt, rst, html, yaml, json, csv, etc.) into the documents table
- Parser with frontmatter extraction for markdown/mdx files
- Directory scanner with configurable extension and ignore filters
- Importer with last-write-wins conflict resolution (upsert by repo_name + file_path)
- Exporter with optional frontmatter generation
- SDK client adapter using duck-typed interfaces for compatibility
- CLI: agentic-db docs import/export/list commands
- 52 tests covering parser, scanner, importer, exporter, and roundtrip scenarios
@devin-ai-integration
Copy link
Copy Markdown

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@socket-security
Copy link
Copy Markdown

socket-security Bot commented Apr 30, 2026

No dependency changes detected. Learn more about Socket for GitHub.

👍 No dependency changes detected in pull request

The CLI now depends on @agentic-db/documents-loader, which needs to be
built before the e2e tests can resolve it via tsx.
@devin-ai-integration
Copy link
Copy Markdown

Test Results: @agentic-db/documents-loader

Testing approach: Shell-based testing against built package with temp directories and mock clients (no DB credentials available).

All 8 tests passed (86 total assertions)
# Test Result
1 Unit test suite (52 Jest tests) ✅ Passed
2 Build produces valid CJS + ESM + types ✅ Passed
3 Adversarial parser edge cases (17 assertions) ✅ Passed
4 Roundtrip integrity (10 assertions) ✅ Passed
5 CLI docs import/export/list registered ✅ Passed
6 TypeScript compilation (both packages) ✅ Passed
7 Dry-run mode (3 assertions) ✅ Passed
8 Last-write-wins conflict resolution (4 assertions) ✅ Passed
Adversarial edge cases tested
  • Empty file → title derived from filename, content=""
  • Frontmatter-only → title from frontmatter, content=""
  • No trailing newline → exact content match
  • Special chars (colons, quotes) in title → parsed correctly
  • Windows CRLF line endings → title and tags parsed
  • 5-level directory nesting → scanner preserves paths
  • Unicode filename (名前) → title and content correct
  • Binary .png → excluded by scanner (7 .md found, 0 .png)
Roundtrip & conflict resolution
  • Roundtrip: Import 3 files (.md, .txt, .yaml) → export → .md has frontmatter, .txt/.yaml do not, all content preserved exactly
  • Dry-run: 3 files scanned, 0 created, mock store empty
  • Last-write-wins: File re-import overwrites DB edits (content="New content from file", not "DB edit by someone else")
Not tested (no DB access)
  • End-to-end against real PostgreSQL
  • Embedding auto-trigger (requires DB triggers)
  • CLI interactive prompts (requires TTY)

Devin session

- Add self-contained gitignore parser (no external deps) with 22 tests
- Scanner now reads .gitignore files (root + nested) and skips ignored paths
- Add skipGitignore option to ScanOptions for opting out
- Export gitignore utilities from package index for potential upstream to dev-utils
- Add documents-loader-tests CI job (no database required)
@pyramation pyramation merged commit 2e99530 into main Apr 30, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant